22 research outputs found
A hybrid quantum approach to leveraging data from HTML tables
The Web provides many data that are encoded using HTML tables. This facilitates
rendering them, but obfuscates their structure and makes it difficult for automated business
processes to leverage them. This has motivated many authors to work on proposals to
extract them as automatically as possible. In this article, we present a new unsupervised
proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task:
guessing whether the cells have labels or values. The problem is addressed using a
clustering approach that is known to be NP using standard computers, but our proposal can
solve it in polynomial time, which implies a significant performance improvement. It is
novel in that it relies on an entropy-preservation metaphor that has proven to work very
well on two large collections of real-world tables from the Wikipedia and the Dresden Web
Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art
proposal in terms of both effectiveness and efficiency; the key difference is that our
proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-106
Extracting Web Information using Representation Patterns
Feeding decision support systems with Web information typically
requires sifting through an unwieldy amount of information that is
available in human-friendly formats only. Our focus is on a scalable
proposal to extract information from semi-structured documents
in a structured format, with an emphasis on it being scalable and
open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of
human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean
that it must extract as much useful information as possible and not
be subject to any pre-defined data model. In the literature, there is
only one open but not scalable proposal, since it requires human
supervision on a per-domain basis. In this paper, we present a new
proposal that relies on a number of heuristics to identify patterns
that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very
competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-
On exploring data lakes by finding compact, isolated clusters
Data engineers are very interested in data lake technologies due to the incredible abun dance of datasets. They typically use clustering to understand the structure of the datasets
before applying other methods to infer knowledge from them. This article presents the first
proposal that explores how to use a meta-heuristic to address the problem of multi-way
single-subspace automatic clustering, which is very appropriate in the context of data
lakes. It was confronted with five strong competitors that combine the state-of-the-art
attribute selection proposal with three classical single-way clustering proposals, a recent
quantum-inspired one, and a recent deep-learning one. The evaluation focused on explor ing their ability to find compact and isolated clusterings as well as the extent to which such
clusterings can be considered good classifications. The statistical analyses conducted on
the experimental results prove that it ranks the first regarding effectiveness using six stan dard coefficients and it is very efficient in terms of CPU time, not to mention that it did not
result in any degraded clusterings or timeouts. Summing up: this proposal contributes to
the array of techniques that data engineers can use to explore their data lakesMinisterio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-1060Junta de Andalucía US-138137
On Extracting Data from Tables that are Encoded using HTML
Tables are a common means to display data in human-friendly formats. Many
authors have worked on proposals to extract those data back since this has
many interesting applications. In this article, we summarise and compare many
of the proposals to extract data from tables that are encoded using HTML and
have been published between 2000 and 2018. We first present a vocabulary that
homogenises the terminology used in this field; next, we use it to summarise
the proposals; finally, we compare them side by side. Our analysis highlights
several challenges to which no proposal provides a conclusive solution and a
few more that have not been addressed sufficiently; simply put, no proposal
provides a complete solution to the problem, which seems to suggest that this
research field shall keep active in the near future. We have also realised that
there is no consensus regarding the datasets and the methods used to evaluate
the proposals, which hampers comparing the experimental results.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-
A clustering approach to extract data from HTML tables
HTML tables have become pervasive on the Web. Extracting their data automatically is difficult
because finding the relationships between their cells is not trivial due to the many different
layouts, encodings, and formats available. In this article, we introduce Melva, which is an
unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any
external knowledge bases. It relies on a clustering approach that helps make label cells apart
from value cells and establish their relationships. We compared Melva to four competitors on
more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The
conclusion is that our proposal is 21.70% better than the best unsupervised competitor and
equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding
efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106
TOMATE: A heuristic-based approach to extract data from HTML tables
Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first
applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most
important contribution is regarding functional analysis, which we address by projecting
the cells onto a high-dimensional feature space in which a standard clustering technique
is used to make the meta-data cells apart from the data cells. We experimented with
two large repositories of real-world HTML tables and our results confirm that our proposal
can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table.
We confronted our proposal with several competitors and the statistical analysis confirmed
its superiority in terms of effectiveness, while it keeps very competitive in terms of
efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4
On the synthesis of metadata tags for HTML files
RDFa, JSON-LD, Microdata, and Microformats allow to endow the data in
HTML files with metadata tags that help software agents understand them.
Unluckily, there are many HTML files that do not have any metadata tags,
which has motivated many authors to work on proposals to synthesize them.
But they have some problems: the authors either provide an overall picture of
their designs without too many details on the techniques behind the scenes or
focus on the techniques but do not describe the design of the software systems
that support them; many of them cannot deal with data that are encoded using
semistructured formats like forms, listings, or tables; and the few proposals that
can work on tables can deal with horizontal listings only. In this article, we
describe the design of a system that overcomes the previous limitations using a
novel embedding approach that has proven to outperform four state-of-the-art
techniques on a repository with randomly selected HTML files from 40 differ ent sites. According to our experimental analysis, our proposal can achieve an
F1 score that outperforms the others by 10.14%; this difference was confirmed
to be statistically significant at the standard confidence level.Junta de Andalucía P18-RT-1060Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-
Tocilizumab in refractory Caucasian Takayasu's arteritis: a multicenter study of 54 patients and literature review
Objective: To assess the efficacy and safety of tocilizumab (TCZ) in Caucasian patients with refractory Takayasu's arteritis (TAK) in clinical practice.
Methods: A multicenter study of Caucasian patients with refractory TAK who received TCZ. The outcome variables were remission, glucocorticoid-sparing effect, improvement in imaging techniques, and adverse events. A comparative study between patients who received TCZ as monotherapy (TCZMONO) and combined with conventional disease modifying anti-rheumatic drugs (cDMARDs) (TCZCOMBO) was performed.
Results: The study comprised 54 patients (46 women/8 men) with a median [interquartile range (IQR)] age of 42.0 (32.5-50.5) years. TCZ was started after a median (IQR) of 12.0 (3.0-31.5) months since TAK diagnosis. Remission was achieved in 12/54 (22.2%), 19/49 (38.8%), 23/44 (52.3%), and 27/36 (75%) patients at 1, 3, 6, and 12 months, respectively. The prednisone dose was reduced from 30.0 mg/day (12.5-50.0) to 5.0 (0.0-5.6) mg/day at 12 months. An improvement in imaging findings was reported in 28 (73.7%) patients after a median (IQR) of 9.0 (6.0-14.0) months. Twenty-three (42.6%) patients were on TCZMONO and 31 (57.4%) on TCZCOMBO: MTX (n = 28), cyclosporine A (n = 2), azathioprine (n = 1). Patients on TCZCOMBO were younger [38.0 (27.0-46.0) versus 45.0 (38.0-57.0)] years; difference (diff) [95% confidence interval (CI) = -7.0 (-17.9, -0.56] with a trend to longer TAK duration [21.0 (6.0-38.0) versus 6.0 (1.0-23.0)] months; diff 95% CI = 15 (-8.9, 35.5), and higher c-reactive protein [2.4 (0.7-5.6) versus 1.3 (0.3-3.3)] mg/dl; diff 95% CI = 1.1 (-0.26, 2.99). Despite these differences, similar outcomes were observed in both groups (log rank p = 0.862). Relevant adverse events were reported in six (11.1%) patients, but only three developed severe events that required TCZ withdrawal.
Conclusion: TCZ in monotherapy, or combined with cDMARDs, is effective and safe in patients with refractory TAK of Caucasian origin.Funding: This work was partially supported by RETICS Programs, RD08/0075 (RIER), RD12/0009/0013 and RD16/0012 from “Instituto de Salud Carlos III” (ISCIII) (Spain)
Treatment with tocilizumab or corticosteroids for COVID-19 patients with hyperinflammatory state: a multicentre cohort study (SAM-COVID-19)
Objectives: The objective of this study was to estimate the association between tocilizumab or corticosteroids and the risk of intubation or death in patients with coronavirus disease 19 (COVID-19) with a hyperinflammatory state according to clinical and laboratory parameters.
Methods: A cohort study was performed in 60 Spanish hospitals including 778 patients with COVID-19 and clinical and laboratory data indicative of a hyperinflammatory state. Treatment was mainly with tocilizumab, an intermediate-high dose of corticosteroids (IHDC), a pulse dose of corticosteroids (PDC), combination therapy, or no treatment. Primary outcome was intubation or death; follow-up was 21 days. Propensity score-adjusted estimations using Cox regression (logistic regression if needed) were calculated. Propensity scores were used as confounders, matching variables and for the inverse probability of treatment weights (IPTWs).
Results: In all, 88, 117, 78 and 151 patients treated with tocilizumab, IHDC, PDC, and combination therapy, respectively, were compared with 344 untreated patients. The primary endpoint occurred in 10 (11.4%), 27 (23.1%), 12 (15.4%), 40 (25.6%) and 69 (21.1%), respectively. The IPTW-based hazard ratios (odds ratio for combination therapy) for the primary endpoint were 0.32 (95%CI 0.22-0.47; p < 0.001) for tocilizumab, 0.82 (0.71-1.30; p 0.82) for IHDC, 0.61 (0.43-0.86; p 0.006) for PDC, and 1.17 (0.86-1.58; p 0.30) for combination therapy. Other applications of the propensity score provided similar results, but were not significant for PDC. Tocilizumab was also associated with lower hazard of death alone in IPTW analysis (0.07; 0.02-0.17; p < 0.001).
Conclusions: Tocilizumab might be useful in COVID-19 patients with a hyperinflammatory state and should be prioritized for randomized trials in this situatio